1. Fundamentals

1.1. Linear Independence

Definition 1.1.1

A set of vectors is linearly independent iff

Definition 1.1.2
A basis for a set of vectors is any set of vectors where linear combinations of vectors in can yield every vector in .

1.2. Orthogonality

Theorem 1.2.1
A set of mutually orthogonal vectors is a set of linearly independent vectors.

Proof: Define the basis to be . Then we have

Now suppose the basis was linearly dependent. Then, there must exist such that

Now we have

Hence, we’ve arrived at a contradiction, and our basis must be linearly independent.

Theorem 1.2.2
The product of orthonormal matrices (columns are orthogonal and unit norm) is orthonormal.

Proof:

1.3. Rank and Spaces

Definition 1.3.3
The space spanned by the columns of a matrix is called the column space and similarly the space spanned by the rows is the row space. The rank of a space is the size of the smallest basis needed to span it.
Lemma 1.3.1
Let . If is invertible, the columns of are linearly independent iff the columns of are linearly independent.

Proof: Suppose s.t.

Then left-multiplying by immediately reveals…

The reverse direction is shown by using .

Lemma 1.3.2
Let . If is invertible,

Proof: Suppose has column-rank , e.g it has linearly independent vectors. The previous lemma implies that has at least linearly independent vectors. Thus . The reverse is also true, since we can apply the same argument in reverse by saying . This forces equality.

Theorem 1.3.3
The row-rank is equal to the column-rank.

Proof: Since all standard row operations (swapping, scaling, adding) can be represented as an invertible matrix applied on the left, we can row-reduce the matrix until it is in reduced row-echelon form.

We can do a similar set of transforms with columns by multiplying by an invertible matrix on the right, allowing us to arrive at a matrix with at most a single one per column or row. From here, it is clear the row-rank is equal to the column-rank.

This has some interesting implications, like .

Theorem 1.3.4

Proof: Suppose we have .

Furthermore, it’s easy to see that every vector orthgonal to is in . Hence, we can find an orthonormal basis for and then extend it to a full basis for via vectors in . Thus, .

The argument is identical for rowspace except using rows instead of columns.

Theorem 1.3.5

Proof: Imagine we have and . Then, . But this implies by the above theormem. Hence, every non-zero vector out of gets mapped to a unique vector out of .

1.4. Trace

Theorem 1.4.6
The trace is invariant to cyclic permutations.

Proof:

Furthermore,

So the trace of a matrix is the sum of its eigenvalues, or more generally the trace of is the sum of singular values squared.

1.5. Triangular Matrices

Theorem 1.5.7
The determinant of a triangular matrix is the product of its diagonal entries.

Proof: The determinant is invariant to column addition (think about a parallelogram), hence you can just zero out all the entries not in the diagonals, and the claim becomes plainly true.

Theorem 1.5.8
The Triangular matrix equation can be solved in time.

Proof: Use back-substitution.

1.6. Similar Matrices

Definition 1.6.4
and are similar if for some matrix

Similar matrices have a lot of nice properties.

Using Taylor expansions…

Crucially, every square matric is similar to an upper-triangular matrix. You can see this with the Schur or Jordan Decompositions. By the two properties above, we have

This leads to the Spectral Mapping Theorem, which states that the eigenvalues of are . To see the Spectral Mapping theorem, observe that for a triangular matrix its determinant is equal to the product of its diagonal entries.

This immediately implies the Cayley-Hamilton theorem, which states that every matrix satisfies its own characteristic equation.

1.7. Symmetric Matrices

Theorem 1.7.9
All eigenvalues of a symmetric real matrix are real.

Proof:

Theorem 1.7.10
For a symmetric matrix, all eigenvectors with different eigenvalues are orthogonal.

Proof:

Theorem 1.7.11
Symmetric Matrices are diagonalizable.

Proof: We know all the eigenvalues of the symmetric matrix are real. Hence, has some real eigenvector corresponding to some real eigenvalue, call this . Extend to an orthonormal basis for . Let be the orthonormal matrix whose columns are . Now, define as

Observe is symmetric.

Hence, is symmetric. Furthermore,

Combining the two properties, looks like this.

Because is symmetric its principal submatrix is also symmetric. Observe that any eigenvector for can produce an eigenvector for , . We can use the first column of and to orthogonalize , and we can keep doing this until we arrive at which is completely diagonal.

Where is a diagonal matrix consisting of the eigenvalues of and is the product of the orthonormal matrices. Since the product of orthonormal matrices is also orthonormal, and thus is diagonalizable.

1.8. Projection

Definition 1.8.5
A projector is a matrix which satisfies .
Lemma 1.8.3
If is a projector, then is also a projector, which projects onto the null space of .

Proof:

Theorem 1.8.12

Let and . Then the unique solution to

is .

Proof: The choice of and above clearly produces a valid solution. However, perhaps there is some vector , such that , and , which yields another solution? This requires . As , the only vector which satisfies this is zero.

The above theorem is probably the single most important fact about projectors. Projectors yield a unique decomposition of vectors via a portion within a subspace and a portion outside of it. One particularly useful way to break down a vector is into a portion within a subspace and a portion orthogonal to it.

Definition 1.8.6
A projector whose null space is orthogonal to its range is an orthogonal projector.
Theorem 1.8.13
Orthogonal projectors must satisfy .

Proof: For the forward direction we have

For the backwards, let be a basis for and let be a basis for . Since we are assuming is an orthogonal projector, these two sets of vectors are mutually independent and thus form a basis for . Define the th column of to be . We immediately have,

And thus,

Hence, we’ve shown , and it’s easy to see that .

From the last proof, we also can say any orthogonal projector can be written as where . You can check that the reverse also holds. If I have some subspace defined the orthonormal basis encoded in , then defines a projector onto that subspace.

You can also define an orthogonal projector when given a non-orthogonal basis . Let be in the input vector and let be the projection. Since , let . Then

So we get the new projector, .

Orthogonal projectors are nice because they immediately tell you the closest point in a subspace. Suppose I have a vector and I have a the closest point in the subspace denoted . Then

Where . Hence, orthogonal projection is the canonical solution to least squares problems, e.g. find to minimize .

1.9. Generalized Inverse

Suppose we want to solve . If is not a square, full-rank matrix, there is no such that , which essentially means a solution does not exist for all . However, there still might be a solution to the above system if just happens to lie in the column space of . This motivates the following definition.

Moore-Penrose-PseudoInverse
Definition 1.9.7

Let be the matrix which satisfies

  1. is an orthogonal projector onto the column space of .
  2. is an orthogonal projector onto the row space of .

To see why this is useful, let’s plug these definitions into our problem.

Suppose .

So if we find to produce the closest vector, and if we find the smallest that produces it. The motivation for the second condition is it’s a natural tie-breaker when there are multiple solutions.

From these definitions, it’s easy to see the generalized inverse is satisfied by the following decomposition.

1.10. Norms

Definition 1.10.8

The operator norm of a matrix is the largest amount it can scale any vector.

The specific value of the operator norm depends on how you measure “large”. For example, you could measure how much the L2-Norm scales. You can even use different input and output norms. For example,

Theorem 1.10.14

Proof: Observe that can be written as using the SVD. Since and are orthonormal, controls the scaling. Hence, the operator norm is equal to the largest entry of .

Definition 1.10.9

The Frobenius norm of a matrix is the L2-norm of its singular value vector.

This is also equivalent to the sum of the matrice’s entries squared (just look at the SVD and observe orthonormal matrices don’t change vector norms). This is algebraically powerful because you can express it in terms of traces.

Definition 1.10.10

The Nuclear or Trace norm is the L1-norm of its singular value vector.

Traces have some very nice algebraic properties, so this is probably the easiest norm from an algebraic standpoint. Furthermore, the Nuclear norm has the nice property that the corresponding norm ball has its extreme points at rank 1 matrices. Hence, it shows up in convex optimization contexts when you want low-rank results.

1.11. Definite and Indefinite

Definition 1.11.11

A matrix is positive-definite if

You can similarly definite negative definite and the semi variants which allow . If it doesn’t fit into any of these categories, it’s called indefinite.

The best way to think about these matrices is in terms of the object. This object is literally a quadratic equation in high dimensions. For a definite matrix, all choices of decrease , or all directions increase . Hence, you have a nice bowl shaped quadratic and this makes optimization easy. On the other hand, if some directions move you up and some move you down, you end up with a saddle. This can mess up gradient-descent based optimization methods.

Theorem 1.11.15
Positive definite matrices have positive diagonal entries.

Proof:

1.12. Companion Matrix

A companion matrix is just a construction whose eigenvalue equation is exactly some polynomial. For the polynomial . It can be constructed as follows.

This is useful to know because it means every root finding problem can be formulated as an eigenvalue problem. Since there is no formula for polynomials of degree five and greater, it implies there is no finite sequence of operations which can diagonalize an arbitrary matrix.

1.13. Stability

Forward-Stable
An algorithm that gives almost the right answer (within epsilon relative error) to almost the right problem (inputs within epsilon relative error).
Backward-Stable
An algorithm which can be interpreted to give the exact right answer to a perturbed problem (within epsilon relative error of the original problem).

Inner products are forward and backward stable. Outer products are not backwards stable, because you cannot usually interpret the output matrix as the outer product of two perturbed input vectors (the perturbation to the output matrix doesn’t factor).

If you have backwards stability, you can bound the error of the computation as follows.

  1. Compute the maximum a local perturbation can affect the output (e.g. a difference causes at most a change in the output). The size of this difference is known as the condition of the problem.
  2. Use step 1 to bound the error of the algorithm by bounding how different the output to the approximate problem is versus the original problem.

This is known as backwards error analysis and is much simpler than the naive approach, where you try to compute the aggregate errors of individual floating point operations in an algorithm.

Definition 1.13.12

The condition number of a matrix is how sensitive it is to small perturbations. Suppose I have the equation , and there is some error , in . The condition number tells me the maximum ratio of relative error in to relative error in .

Where the last step came from choosing . You can also verify you get this same bound if there’s relative error in or in .

1.14. Kronecker Product

Definition 1.14.13

The Kronecker product is a block-matrix of the form

Definition 1.14.14
is the vector obtained by stacking the columns of .
Theorem 1.14.16

Proof: Observe that

If you do this for every , you get the appropriate Kronecker product.

This is a useful tool when you are solving for matrices, for example

2. Decompositions

2.1. Jordon Normal Form

Theorem 2.1.1
Every square matrix can be written as where is block-diagonal.

The “blocks” in the block diagonal correspond to repeated eigenvalues, and the only off-diagonal entries have value , are right above the diagonal and have an equal left and bottom child.

The block decomposition is so simple that we can analytically compute by using Taylor expansions for each block. Pretty awesome.

One big problem is it’s not numerically stable because it depends a lot on whether two eigenvalues are exactly equal or not. Hence, it’s often preferred to use the Schur Decomposition.

2.2. Schur Decomposition

Theorem 2.2.2
Every square matrix can be written as where is upper-triangular and is orthonormal.

Proof: This is almost the same argument used to show Symmetric matrices are diagonalizable. Let the matrix be . Find an eigenvalue. Construct an orthogonal basis for the corresponding eigenbasis. Extend the basis to cover the full space, and create a corresponding change of basis matrix . Now we have,

You can repeat the process recursively on .

2.3. QR Decomposition

Theorem 2.3.3
Any matrix can be written as , where is orthonormal.

Proof: Take an input a matrix . Normalize the first column. Now remove the component of the second column along the first column and normalize. Repeat this process until you’ve orthogonalized every column of . This is the matrix. By construction, the th column of can be written as a weighted sum of columns up to in , and hence we get .

This decomposition is handy for solving linear equations where the and so the inverse is not defined. Consider the case of . can be handled similarly.

The blocked representation works because only linearly independent vectors are needed to span the column space of . can be computed very efficiently via back-substitution.

This can also be used for computing determinants, since can be chosen to have determinant , hence

Where the last step follows from being triangular.

2.4. SVD

Theorem 2.4.4
Any matrix with rank can be written as where , , are orthonormal, and is diagonal and non-negative.

Proof: Since symmetric matrices are diagonalizable, we can write

Where is an orthonormal matrix and is a diagonal matrix. has all positive entries. This is because

Now let and suppose has rank . Then, only the first columns of have non-zero entries, and only the first rows of matter. Hence, we can write . Now let and let . By construction, .

Finally, observe is orthonormal. We use the fact that

This decomposition for is known as the reduced SVD. The entries of are known as the singular values, and the columns of and are known as the singular vectors.

2.5. LU Decomposition

Theorem 2.5.5

Any matrix can be written as

Where P is a permutation matrix, L is lower-triangular, and U is upper-triangular.

Proof: Essentially, we get this result from Gaussian elimination. First observe that when we left-multiply a matrix by a matrix , the entries of can be interpreted as “how much of each row” to include.

Now, in Gaussian elimination, we walk down the diagonal, and subtract off multiples of the current row until everything in the current column below our entry has been eliminated. This entire operation can be summarized like this.

Where and are chosen to zero out the entries beneath the pivot (). Call the matrix used at the th step . This can fail if the current diagonal entry is zero, but in that case we can just pivot: swap in a row with a non-zero entry. If there is no such row, then everything below the entry is already zero so you can keep going (note that in this case, the matrix is not invertible). This corresponds to multiplication by a permutation matrix, call the one we use at the th iteration .

At the end of Gaussian elimination, we’ll end up with an upper-triangular matrix , and the product of several row subtraction operations like above.

To see the re-arrangement step, we just do

It’s easily verified that the matrices only have their sub-diagonal entries re-arranged. As rough intuition for what happens, the left multiplications rearrange the rows, messing up the diagonals. The right multiplications rearrange columns which corrects the diagonal without touching the rest of the rows.

Now, each of the terms can be inverted by negating the sub-diagonal entries (just try it). Hence, we can write

Finally, the product of lower-triangular matrices is lower-triangular which gives the final result.

2.6. Cholesky Decomposition

Theorem 2.6.6

Any symmetric positive definite matrix can be written as

Where L is lower-triangular.

Proof: This uses the same idea of Gaussian elimination, except instead of zeroing out just the column we take advantage of the symmetry and also zero out the row.

Hence, we apply operations like

At the end we’ll end up with

From here doing the same inverse trick as Gaussian elimination gives the desired result. Now, it’s important to note that we did not need pivoting during this process. This is because the positive definite property guarantees that all diagonal entries are non-zero.

The simpler decomposition and lack of pivoting makes this decomposition very useful when its applicable.

3. Linear Routines

3.1. Richardson Iteration

Theorem 3.1.1

Suppose we want to solve . Computing inverses is expensive, and not easily done in parallel. Under certain conditions, we can instead run the following equation until we reach a fixed point.

Proof: Subtracting the true answer from both sides, and writing .

So, the error goes to zero regardless of our initialization if , i.e. for all eigenvalues of . If has both positive and negative eigenvalues, then no matter the choice of this condition is violated and there won’t be convergence. This means should not be indefinite. It’s also easy to see that if the s are tightly concentrated, then can be made quite small, allowing fast convergence.

As a sidenote, Richardson Iterations are equivalent to doing gradient descent on the following quadratic function.

The convergence condition above corresponds to ensuring the quadratic is bowl shaped instead of saddle shaped, which ensures the gradient consistently points towards or away from the flat part of the curve.

3.2. Preconditioning

Definition 3.2.1

Again we want to solve . Instead of doing it directly let’s first rewrite it as . Now let’s do this in two pieces.

This is known as the right-preconditioned system. You can also solve the left-preconditioned system.

Why might this be a good idea? Well, methods like the Richardson Iteration converge fast if the largest and smallest eigenvalues are close. might not have this property, but and might, given a good choice of .

For example, if you use a left-preconditioned system and set , then even if wasn’t positive semi-definite, is positive semi-definite, and Richardson iterations or conjugate gradient descent will converge swiftly.

3.3. Householder Triangularization

I won’t cover this super thoroughly, but it’s a cool idea so here’s the gist. Apply a series of orthogonal matrices to , such that you end up with a triangular matrix.

This is an alternative way to find a decomposition, which intriguingly is a bit slower but also numerically more stable.

Anyway, each orthogonal matrix will zero out the bottom of a column.

You do this by choosing to be a reflection matrix, where it reflects along the line of symmetry between it and the target vector.

To do this, first take the difference between the current vector and the target vector, normalize it, and call it . For a vector , the component in the direction of is . If you remove twice that component, you will have performed a reflection. Hence, you just choose to be .

3.4. Reduction to Hessenberg Form

Hessenberg form is like triangularizing a matrix, except we only eliminate zeros beneath the off-diagonal. The final matrix will look like this.

We can transform a matrix into a Hessenberg matrix as follows

  1. Using the same trick as Householder triangularization, find an orthogonal transformation that zeros everything below the off-diagonal.
  2. Set . You can verify the right multiplication will only change columns to right of the current column.

Now you could just left multiply , but doing both allows you to find a Hessenberg matrix is similar too, which is a first step in many other algorithms.

3.5. Rayleigh Quotient Iteration

Definition 3.5.2
The Rayleigh Quotient is defined as .
Lemma 3.5.1
The eigenvalues of are of the form

Proof:

Power Iteration
Definition 3.5.3
Power iteration is the process whereby we take , compute , then , etc. until some convergence criteria is met

It’s easy to see that if we normalize at the end, the vector will converge to the eigenvector with the largest eigenvalue. Combining this with the above lemma gives a way to use power iteration to generate a particular eigenvector of a matrix, provided we have a good estimate of the corresponding eigenvalue. We can then compute the Rayleigh Quotient of this new eigenvector to get a better estimate of the corresponding eigenvalue, and repeat until convergence.

3.6. QR Iteration

Theorem 3.6.2
Under certain conditions (where is linearly independent vectors) converges to the eigenvectors associated with the largest eigenvalues.

Proof: The idea is essentially power iteration but for many vectors. Define to be a matrix with linearly independent columns. Let be a symmetric real matrix with eigen-decomposition . Let be the matrix whose first columns of and be the top left subset of . Finally, let be an matrix representing some error term. Then…

We drop the term because we assume it’s smaller than all the eigenvalues chosen, hence its contribution drops off exponentially.

We conclude that has the same column space as . Now, repeat this argument over and over, where is the first column, the first two columns, etc. Since it converges to the same column space every time, it must be that the th column of converges to the eigenvector associated with the th largest eigenvalue.

In order for this argument to hold, must always be invertible. One way this can hold is if the first rows of are linearly independent. This is equivalent to saying all the leading principle submatrices are non-singular, or that has an LU decomposition.

In general though, you don’t need to worry about this, as projecting onto a space with rank less than only happens in adversarial cases. It’s similar to the odds of random 2D points being colinear.

This gives us a method to directly compute all the eigenvectors! However, we also know from regular power iteration that each of the vectors in is also converging to the eigenvector with the largest eigenvalue. Hence, this basis will be really ill-conditioned. The easy fix is to simply find an orthonormal basis for the current on every iteration, and use that instead.

That’s essentially it! It’s worth noting that usually you don’t bother having a separate , and instead just do

This is equivalent to simultaneous iteration.

The last step essentially specifies the same update rule as simultaneous iteration. It’s worth noting that the QR decomposition can be found in instead of time. The trick is to first reduce to Hessenberg form, and then you can use Householder reflectors which only touch two rows to zero out the sub-diagonal entries. The Hessenberg structure will remain throughout this process, via a similar argument used in reducing to Hessenberg form.

Also, it’s easy to see the diagonal entries of are Rayleigh quotients, and are converging rapidly towards the eigenvalues. You can use these estimates to further accelerate convergence of the QR algorithm (it’s called the shifted QR algorithm) but I won’t show that here.

3.7. Arnoldi Iteration

This is a similar process to Gram-Schmidt orthogonalization, but it’s used to compute a similar Hessenberg matrix. The main benefit over using Householder reflectors is that you can stop the process part way through (with Householder reflectors, you need to know every single reflector before you can conclude what the entries of and are). This is very desirable for high-dimensional problems, where a partial reduction suffices.

Let , with Hessenberg and Q orthonormal. Then,

If we restrict ourselves to the first columns of :

Where is an sized Hessenberg matrix. We need the first columns on the RHS because the Hessenberg matrix has sub-diagonal entries. This gives us a recurrence for the th column.

Algorithmically, think of this as plugging some term in on the left and extracting a new vector . You can compute for the first terms using projections, and for the last term it’s just the magnitude of the remaining vector. For the first term , vector you can choose any arbitrary .

This is a bit different from Gram-Schmidt orthogonalization. Gram-Schmidt took each column of , and subtracted out components. The Arnoldi iteration makes no attempt to use the columns of . Instead, it repeatedly invokes on the previous iterate to compute the next iterate. It’s easy to see the Arnoldi iteration computes a basis for the Krylov subspace (just stare at the recurrence).

This Hessenberg matrix also has an interesting interpretation.

Where is the same as with the bottom row chopped off. can be viewed as a projection matrix. Imagine I have some vector in the basis of , applying is equivalent to applying and then projecting back onto the Krylov subspace.

This Krylov basis () is interesting. Imagine writing a vector in this basis.

This gives Arnoldi iterations a connection to polynomials of matrices. You can prove that the characteristic polynomial of , minimizes

Over the space of monic degree polynomials.

As a quick ad-lib of what happens,

The last step you get by analyzing as a block matrix. It’s only necessary that the first column of since is by construction a multiple of . If any other polynomial (say ) satisfied these conditions, you could say which would imply is not full rank.

Let . The above result gives cool insights into the eigenvalues of (called the Ritz values).

  1. The Ritz values of are just shifted up by .
  2. The Ritz values of are just scaled by .

To see this, just think about how to change the roots of in order to keep as small as possible.

This also gives insights into why the Ritz values are good approximations of eigenvalues. Namely, imagine if wasn’t forced to be degree , and could be larger. Then you could choose to be the characteristic polynomial of , which would force by the Cayley-Hamilton theorem. Since is only of dimension , you cannot do that, but you could choose the zeros of to be close to the largest eigenvalues of . This zeros out the eigenspaces which typically contribute to most of the magnitude of . Of course, might not have much of a component in those dominant eigenspaces, so it may not be optimal to do this, but this nonetheless explains the tendency.

3.8. GMRES

GMRES aims to solve . It does this by finding to solve the following least-squares problem.

Since we can equivalently write the problem as

Where the last step comes from observing both terms are in the column space of , so “projecting” it via won’t change any norms. This is a typical least-squares problem, and can be solved in the usual way. So the algorithm GMRES proposes is…

  1. Run a step of the arnoldi iteration.
  2. Solve a least squares problem to compute , then compute an approximate solution to as .
  3. Go back to step 1.

In addition, it’s possible to reuse work in successive least-squares computations that brings down the overhead to linear time.

3.8.1. Convergence

This can again be analyzed as a polynomial approximation problem.

This time, we have and is a degree-n polynomial. How fast shrinks is determined by how small can get. If you assume is diagonalizable,

So essentially, GMRES converges fast if the condition number is small, and if you can find a polynomial that makes all the eigenvalues small. Now, complex analysis tells us that for any holomorphic function, the value at its interior is equal to the average of values on a ring around it. Hence, if the eigenvalues form a ring around 0 (where the polynomial evaluates to 1), then the convergence rate is very slow. If they’re tightly concentrated somewhere else, say at , then it’s easy to see a polynomial of the form reduces the error very quickly. The more spread out eigenvalues, the more unique roots you will need and the slower the overall convergence.

3.9. Bi-Orthogonalization Methods

These methods solve the same problems as before, and . However, unlike the Arnoldi iteration, they don’t construct a basis for that is orthonormal. Instead, they construct a basis that adheres to a 3-term recurrence.

They construct two such bases and force them to be mutually orthogonal.

This is very analogous to previous methods. You can use the eigenvalues of to estimate the eigenvalues of . You can choose or to be the closest vector to in the current Krylov subspace (residual minimization), etc.

The main advantage of these methods is they need only a 3-term recurrence, while the main disadvantage is their stability guarantees aren’t as good ( and need not be well-conditioned).

4. Applications

4.1. Closest Orthonormal Matrix

Suppose we want to find such that , where represents the Frobenium norm. Then we have

Define . Observe that is orthonormal as it is the product of orthonormal matrices.

The last statement is because no entry of an orthonormal matrix can exceed by normaity. Hence, it’s easy to see is the optimal choice.

This has a really cool application in robotics. Essentially, imagine I have a point cloud representing an object defined by . Suppose I know in it’s base reference frame, the point cloud looks like . Then computing my rotation is equivalent to solving the following problem.

You can reformulate this into a closest orthonormal matrix.

You can extend this to “closest rigid transform” by simply subtracting the centroids from both point sets and then computing this. The difference in centroids tells you the movement, tells you the rotation. This works great for things that naturally see the world in point clouds, like Lidar. SLAM (a modern pose estimation algorithm) essentially does this but with an initial guess of what points from the previous timestep correspond to points from the current timestep.

4.2. Muon

The main insight of Muon is gradient updates on weight matrices shouldn’t change the outputs too drastically. Specifically, it would be nice if your weight update satisfied

This is exactly equivalent to bounding the spectral norm of the matrix. is just a scaling parameter, so let’s just consider the family of matrices with spectral norm less then , . What matrix in is the “best” choice? One reasonable choice is the matrix which decreases the loss the most. In other words,

It’s common to apply gradient updates by subtracting them, so in practice we actually want

I showed in the closest orthonormal matrix section that the solution to this is obtained via , where and come from the SVD of the gradient.

You’ll notice the resulting weight update is orthonormal. There’s a good reason for that! First observe the is convex, meaning if , then .
Proof:

Thus, this problem boils down to optimizing a linear functional over a convex set. The optimal solution to any such problem will always be found at the vertices of the convex set. It’s easy to see that orthonormal matrices are vertices of as their spectral values are all ones, and you can’t write as a convex combination of two numbers where at least one of the two is smaller then .

If you actually compute , or apply it to vectors, you’ll notice it seems very different then the gradient. In what tangible sense is “close” to the gradient? Observe,

Hence, components in are mapped to . Now, the nice thing about it also does this! It messes up the scaling, but it keeps the input-output structure intact. If I change the input along one of the right singular vectors, the output still exclusively moves along one of the left singular vectors.

Anyways, the main problem with this idea is computing the SVD is expensive. Here’s how they got around it.

Definition 4.2.1

Theorem 4.2.1

Given the SVD of is

Proof:

It’s easy to extend this to any linear combination of odd powers will also commute with the SVD.

Anyway, this gives you a lot of power. Muon uses this insight to cheaply orthogonalize by choosing a matrix polynomial such that for any , namely .

4.3. PCA

We would like to approximate with a matrix, where . This is often useful because high-dimensional representations are harder to visualize and more computationally expensive to handle.

To make our problem more precise, we want to compress coordinates to coordinates. Then, we will grade our approximation based on how well the “decompressed” coordinates line up with the original coordinates. If we insist compression and decompression must be linear, it’s easy to see this is equivalent to the following problem.

Suppose the optimal spanned a rank- subspace . Let be the orthogonal projector onto .

Hence, choosing to be an orthogonal projector is always optimal. Now, we can analyze our original problem. Let per the SVD.

By Cauchy-Schwarz, choosing is optimal.

TODO: Producing samples with a given Covariance. Discretizing differential equations